simon-5507-02-slides

Topics to be covered

  • What you will learn
    • Using variable labels
    • Simple descriptive statistics
    • Printing row with smallest/largest value
    • Missing value logic
    • Simple transformations
    • Histograms
    • Correlations
    • Scatterplots

Review definitions

  • Categorical
    • Small number of possible values
    • Each value associated with a category
  • Continuous
    • Large number of possible values
    • Potentially any value in an interval

Semicolons are important

  • Ends every SAS statement
  • Easy to forget
  • Use this to your advantage
    • Several short lines
    • Indent continuations

Example of stretching statement across multiple lines.

One long line

statement option1 option2 option3 option4;

versus several short lines.

statement
  option1
  option2
  option3
  option4;

Rules for variable names (1/2)

  • Can use mix of
    • letters (A-Z, a-z),
    • numbers (0-9)
    • underscore (_)
    • no blanks, no symbols

Rules for variable names (2/2)

  • Can’t start with a number
    • “a1” but not “1a”
  • Capitalization not important
    • BMI, Bmi, bmi are same
  • Up to 32 characters in length

Recommendations for variable names (1/2)

  • Avoid generic names (x1, var01, etc.)
  • Keep it short
    • Use commonly known abbreviations…
    • …but nothing cryptic
  • Use all lower case (age, not AGE or Age)

Recommendations for variable names (2/2)

  • Separate words with underscores
    • fat_brozek, not fatbrozek
  • Alternative: CamelCase
    • FatBrozek
  • Caution: Writer’s Exchange website
    • www.writersexchange.com

SAS variable labels (1/2)

  • Longer description of a variable
    • Can include blanks, special symbols
    • Internal documentation
    • Labels substituted on some (but not all) output
  • Required in this class (see grading rubric)

SAS variable labels (2/2)

  • Recommendations for variable labels
    • Judicious use of upper and lower case
    • Spell out abbreviations
    • Specify units of measurement
    • Any other important details

Documenting your program. (1)

* 5507-02-simon-continuous-variables.sas
  author: Steve Simon
  date: created 2021-05-30
  purpose: to work with continuous variables
  license: public domain;

* datasets created in this program
    body, original data
    body1, row with ht=29.5 removed
    body2, ht=29.5 converted to missing
    body3, ht_cm calculated;

Specifying file locations (2)

filename rawdata
  "../data/fat.txt";

libname module02
  "../data";

ods pdf file=
  "../results/5507-02-simon-demo.pdf";

Reading in the data (3)

data module02.body;
  infile rawdata;
  input 
    case
    fat_brozek
    fat_siri
    dens
    age
    wt
    ht
    bmi
    ffw
    neck
    chest
    abdomen
    hip
    thigh
    knee
    ankle
    biceps
    forearm
    wrist;

Adding labels (4)

label
    case="Case number"
    fat_brozek="Fat (Brozek's equation)"
    fat_siri="Fat (Siri's equation)"
    dens="Density"
    age="Age (yrs)"
    wt="Weight (lbs)"
    ht="Height (inches)"
    bmi="Body mass index (kg/m^2)"
    ffw="Fat Free Weight (lbs)"
    neck="Neck circumference (cm)"
    chest="Chest circumference (cm)"
    abdomen="Abdomen circumference (cm)"
    hip="Hip circumference (cm)"
    thigh="Thigh circumference (cm)"
    knee="Knee circumference (cm)"
    ankle="Ankle circumference (cm)"
    biceps="Biceps circumference (cm)"
    forearm="Forearm circumference (cm)"
    wrist="Wrist circumference (cm)";
run;

Adding extra information (5)

* Some additional details about this data:

  Brozek's equation is 457/Density - 414.2

  Siri's equation is 495/Density - 450

  Abdomen circumference is measured at the
  umbilicus and level with the iliac crest

  Wrist circumference is distal to the 
  styloid processes;

The footnote subcommand (6)

proc print
    data=module02.body(obs=10);
  var case fat_brozek fat_siri dens age;
  title1 "Ten rows and five columns";
  title2 "of the body data set";
  footnote1 "Created by Steve Simon on &sysdate using SAS &sysver";
run;

Displaying metadata (7)

proc contents
    data=module02.body;
  title1 "Internal description of body dataset";
run;

Live demo, 1

  • data step
    • label subcommand
  • contents procedure

Break #1

  • What you have learned
    • Using variable labels
  • What’s coming next
    • Simple descriptive statistics

Simple descriptive statistics

  • Always look at first
  • Is mean high, normal, or low?
  • Is data spread out or tight?
    • Zero standard deviation is a red flag
  • Are minimum and maximum reasonable?

Computing simple statistics (8)


proc means
    n mean std min max
    data=module02.body;
  var ht;
  title1 "Descriptive statistics for ht";
  title2 "The mean is normal for adults";
  title3 "The standard deviation shows tightly packed data";
  title4 "The maximum value is reasonable";
  title2 "The minimum is very low";
run;

Live demo, 2

  • means procedure
    • n option
    • mean option
    • std option
    • min option
    • max option

Break #2

  • What you have learned
    • Simple descriptive statistics
  • What’s coming next
    • Printing row with smallest/largest value

Sorting your data

  • Uses the sort procedure
  • Specify the dataset with data=
  • Specify the sorting variable with the by subcommand
    • Use descending keyword to sort in reverse order

Printing row with smallest or largest value

  • Investigate other variables associated with outlier
  • Is the data shifted left or right?
  • Are other values consistent with the outlier?

Printing row with smallest value (9)


proc sort
    data=module02.body;
  by ht;
run;

proc print
    data=module02.body(obs=1);
  title1 "The row with the smallest ht";
  title2 "Note the inconsistency with wt";
run;

Printing row with largest value (10)

proc sort
    data=module02.body;
  by descending ht;
run;

proc print
    data=module02.body(obs=1);
  title1 "The row with the largest ht";
  title2 "This seems quite normal to me";
run;

Live demo, 3

  • sort procedure
    • by subcommand
      • descending option

Break #3

  • What you have learned
    • Printing row with smallest/largest value
  • What’s coming next
    • Missing value logic

What to do with outliers

  • Depends on the context, ask for help!
    • Live with it
    • Delete the entire observation
    • Convert the value to missing

How to handle outliers

  • No option is best in all cases
    • Live with them
    • Remove the entire row
    • Convert outlier to missing
  • Always report clearly

Importing missing values

  • Different coding schemes
    • dot (.) or blank ( ), the SAS standard
    • other symbols (*, ?)
    • NA, the R standard
    • NULL, the SQL standard
    • Extreme numbers (-1, 9, 99, 999)
    • Blank ( ) or empty ()

Advice on importing missing values

  • Read the data dictionary
  • Always ask WHY a value is missing
  • Convert any non-standard missing codes
    • if iq=999 then iq=.

Missing value logic in SAS

  • Stored internally as most extremely negative number
    • Approximately \(-1.8 \times 10^{308}\) (on most computers)
  • Can identify with = . or missing()
    • Differs from R
  • Use caution with less than/greater than comparisons
    • age < 18 will include children AND missing ages
    • use age ^= . & age < 18 instead

Removing a row of data (11)


data module02.body1;
  set module02.body;
  if ht = 29.5 then delete;
run;

Converting outlier to missing (12)

data module02.body2;
  set module02.body;
  if ht=29.5 then ht=.;
run;

Printing negative values (wrong way) (13)

proc print
    data=module02.body2;
  where ht < 0;
  title1 "Printing negative values for ht (wrong way)";
  title2 "Use where ht ^= . & ht < 0 instead";

run;

Counting missing values (14)

proc means
    n nmiss mean std min max
    data=module02.body2;
  var ht;
  title1 "There is one missing value";
run;

Live demo, 4

  • data step
    • set subcommand
    • if … then subcommand
  • print procedure
    • where subcommand
  • means procedure
    • nmiss option

Break #4

  • What you have learned
    • Missing value logic
  • What’s coming next
    • Simple transformations

Transforming values

  • Use data step to create a new variable
    • Unit conversion: temperature = 5/9 * (temperature - 32)
    • New variable: bmi = wt_kg / ht_m^2

Non destructive transformations, Different variable name

data name1;
  set name1;
  wt_kg = wt / 2.2;
run;

Non destructive transformations, Different dataset name

data name2;
  set name1;
  wt = wt / 2.2;
run;

Transforming values (15)


data module02.body3;
  set module02.body;
  check_bmi = (wt / 2.2) / (ht / 39.37)**2;
  check_ht = sqrt((wt / 2.2) / bmi) * 39.37;
  check_wt = (bmi * (ht / 39.37)**2) * 2.2;
run;

proc print 
    data=module02.body3;
  var ht check_ht wt check_wt bmi check_bmi;
  where ht=29.5;
  title1 "Recalculating ht, wt, and bmi";
  title2 "Assuming two out of three are correct.";
run;

Live demo, 5

  • No new keywords

Break #5

  • What you have learned
    • Simple transformations
  • What’s coming next
    • Histograms

Historial sidenote, IBM mainframe printer

Historical sidenote, Sample output from proc plot

Historical side note, box drawing characters

Box drawing characters

░   ▒   ▓   │   ┤   ╡   ╢   ╖   ╕   ╣   ║   ╗   ╝   ╜   ╛   ┐
└   ┴   ┬   ├   ─   ┼   ╞   ╟   ╚   ╔   ╩   ╦   ╠   ═   ╬   ╧
╨   ╤   ╥   ╙   ╘   ╒   ╓   ╫   ╪   ┘   ┌   █   ▄   ▌   ▐   ▀

Before box drawing characters

| - +

Historical side note, Pen plotter

Historical note, Dot matrix printer

Today’s technology, Ink jet or laser printer

  • 300 dots per inch or better
  • Up to 16.7 million colors
    • CMYK system versus RGB system

Today’s technology, 3D printers (SAS does not support these)

Drawing histograms

  • Histograms can assess normality/non-normality
    • Skewness
    • Bimodal distributions
    • Outliers
  • How many bars? Multiple recommendations
    • Five to twenty bars
    • Square root of n bars
    • Trial and error

Drawing a histogram (default) (16)


proc sgplot
    data=module02.body2;
  histogram ht;
  title1 "Histogram with default bins";
run;

Drawing a histogram (more bars) (17)

proc sgplot
    data=module02.body2;
  histogram ht / binstart=60 binwidth=1;
  title "Histogram with narrow bins";
run;

Drawing a histogram (fewer bars) (18)

proc sgplot
    data=module02.body2;
  histogram ht / binstart=60 binwidth=5;
  title "Histogram with wide bins";
run;

Live demo, 6

  • sgplot procedure
    • histogram subcommand
      • binstart option
      • binwidth option

Break #6

  • What you have learned
    • Histograms
  • What’s coming next
    • Correlations

Correlations

  • Informal interpretation
    • between +0.7 and +1.0: strong positive association
    • between +0.3 and +0.7: weak positive association
    • between -0.3 and +0.3: little or no association
    • between -0.3 and -0.7: weak positive association
    • between -0.7 and -1.0: strong negative association

Computing correlations (default) (19)



proc corr
    data=module02.body2
    noprint
    outp=correlations;
  var fat_brozek fat_siri;
  with neck -- wrist;
run;

Processing correlations (20)

data correlations;
  set correlations;
  if _type_ NE "CORR" then delete;
  drop _type_;
  fat_brozek=round(fat_brozek, 0.01);
  fat_siri=round(fat_siri, 0.01);
run;

Sorting the correlations (21)

proc sort
    data=correlations;
  by descending fat_brozek;
run;

proc print 
    data=correlations;
  title1 "Abdomen, hip, and chest show the strongest correlations";
run;

Live demo, 7

  • corr procedure - noprint option - outp option with subcommand
  • drop subcommand (data step)
  • round function

Break #7

  • What you have learned
    • Correlations
  • What’s coming next
    • Scatterplots

Drawing a scatterplot (22)


proc sgplot
    data=module02.body2;
  scatter x=abdomen y=fat_brozek;
  pbspline x=abdomen y=fat_brozek;
  title1 "Simple scatterplot shows a strong positive trend";
  title2 "It levels off for high values.";
  title3 "This may be due solely to a single outlier on the high end";
run;

Closing the pdf file (23)

ods pdf close;

Summary

  • What you have learned
    • Using variable labels
    • Simple descriptive statistics
    • Printing row with smallest/largest value
    • Missing value logic
    • Simple transformations
    • Histograms
    • Correlations
    • Scatterplots